Verification labels for rovibronic quantum-state energy uncertainties

Transition wavenumbers contained in line-by-line rovibronic databases can be compromised by errors of various nature. When left undetected, these errors may result in incorrect quantum-state energies, potentially compromising a large number of derived spectroscopic data. Spectroscopic networks treat the complete set of line-by-line spectroscopic data as a large graph, and through a least-squares refinement the measured line positions are converted into empirical quantum-state energies. Spectroscopic networks also offer a highly useful framework to develop mathematical tools helping to identify possible errors and conflicts within the dataset. For example, wavenumber errors can be detected by checking for violations of the law of energy conservation. This paper describes a new graph-theory tool, which results in so-called verification labels for the quantum states. Verification labels help to express the vulnerability of a calculated empirical energy value and its uncertainty against possible wavenumber errors, providing complementary information to simple statistical uncertainties.

the creation of various mathematical tools aiding the detection, and subsequent correction, of data issues in high-resolution molecular spectroscopy [5][6][7][8][9][10][11] .The section "Verification of line lists" elaborates on the relationship between the consistency and the correctness of line-by-line spectroscopic databases.The section "Labeling scheme" introduces a new labeling scheme.The verification label of quantum state X is based primarily on the verification metric, V(X), and secondarily on a graph property of X, both defined within this section.The section "A practical example: the W2020 database of water transitions" demonstrates verification labeling on the example of the W2020 line-by-line spectroscopic database 12 of the H 16 2 O molecule.The section "Conclusions" contains the conclusions of the paper.

Theoretical background
For convenience, Table 1 highlights the most important symbols and terms used in this section and the rest of the paper.

Correctness of wavenumber entries
Let us index the transitions of a line list L with i, and denote the 'true' transition wavenumber, that is the line center position, of the ith transition by w i .The unknown w i is estimated by the wavenumber value in the line list, denoted by ŵi .The wavenumber ŵi is reported together with a measurement uncertainty u i .Let Ŵi = ( ŵi − u i , ŵi + u i ) be the wavenumber interval of the ith transition.The wavenumber interval should include the w i value with a probability of at least 95% , in other words, P(w i ∈ Ŵi ) ≥ 0.95 .This is in accordance with the convention to report uncertainties with a 2σ uncertainty, where σ denotes the standard deviation.Let us call a wavenumber interval Ŵi for which w i / ∈ Ŵi an incorrect wavenumber interval.Since the w i values are not known, it is not straightforward to ascertain whether Ŵi is correct or not.However, one could take external information related to Ŵi into consideration and come up with a decision whether to consider Ŵi correct or incorrect.
A trivial example for an incorrect wavenumber interval Ŵi is when ŵi + u i < 0 , as wavenumber values must be positive reals.Spectroscopic information systems, for example, those based on the MARVEL (Measured Active Rotational Vibrational Energy Levels) technique [13][14][15] , use several advanced supporting methods to assess the correctness of the wavenumber intervals of a line list.
Some of these methods are based on the graph representation of high-resolution rovibronic spectroscopic data, called a spectroscopic network 16 .The new labeling introduced in this paper also relies on this representation.Thus, let us continue by covering the required theory about spectroscopic networks.

Spectroscopic networks
Spectroscopic networks offer a highly useful representation of line-by-line spectroscopic data, especially when they come from a large number of sources of different origin and of different accuracy.The spectroscopic network of a molecule is a graph G(V, E), in which the vertex set V represents the rovibronic quantum states of the molecule, and the edge set E corresponds to allowed transitions between the quantum states.Certain physical quantities can be utilized as weight functions; most notably, quantum state energies as vertex weights, and transition intensities and wavenumbers as edge weights.The term 'spectroscopic network' is not an exact definition of a graph: it has to be specified, based on the given application, which weights to use, or, for example, whether it is defined to be a directed or an undirected graph.
Figure 1 depicts a small SN which has only four quantum states and four transitions among the states.The blue numbers, outside of the graph, represent transition wavenumbers, while the red numbers, inside of the graph, are the respective transition uncertainties, both in units of cm −1 .We will continue referring to Fig. 1 as more definitions and SN properties are introduced.
The size of a SN depends on the underlying spectroscopic data set.While one can construct a network with only a few quantum states and a few transitions among them, like in Fig. 1, usual applications of SNs are characterized by inputs of large size.For example, the H 2 16 O line list in the HITRAN spectroscopic information system 1 has 319,887 lines (transitions) that span 14,130 quantum states.Therefore, one of the main challenges of designing graph algorithms for application in spectroscopy comes from the sheer number of vertices and edges.The uncertainty value of ŵi in the line list

Ŵi
The wavenumber interval ŵi ± u i w ′ i A wavenumber value based on all Ŵi intervals of the line list.They are selected to yield zero-sum cycles The set of the w ′ i values of a line list u ′ i Uncertainty value, based on u i , subject to increase to achieve consistency of the line list

E(X)
Energy value of quantum state X U(X) Uncertainty of quantum state X

P X
The set of the edges of the shortest path from the root to X using the u ′ i edge weights For example, in practice, it is not recommended to use the adjacency matrix representation of SNs.Adjacency matrixes have size of |V | × |V | , where the large |V| values involved in the typical calculations make even storing the matrix challenging, if not impossible.It is advised to use the adjacency list representation instead, that is storing the neighbours of each vertex in a list, yielding a much smaller-sized data structure.Spectroscopic networks can be defined either as directed or undirected graphs.In the directed case, edges are directed from the lower-energy quantum state of the transition towards the higher-energy quantum state (the transition occurs in absorption).In Fig. 1, for example, the directed edge e AB from A to B corresponds to a transition from the lower-energy quantum state A to the higher-energy quantum state B. The quantum state of the molecule that is defined to have the zero energy value is the root of the spectroscopic network.
A line list may contain multiple lines of the same transition; for example, if multiple measurements are available.These are represented by parallel edges in the SN.
For a graph G(V, E), a path P ⊆ E of length k − 1 is an edge set {e 1 , e 2 , ..., e k−1 } ⊆ E for which there exists a vertex set {v 1 , ..., v k } ⊆ V such that for 1 ≤ i ≤ k the endpoints of e i are v i and v i+1 .In this paper, the direction of the edges in a path is defined to be irrelevant; in Fig. 1, there are two paths from A to D: one is {e AD } and the other is {e AB , e BC , e DC } .If edge weights are considered, then a shortest path between two vertices is the edge set with the smallest sum of their weights.For example, the shortest path in Fig. 1 between vertices A and C, using the uncertainties as weights, is A → B → C , with a weight sum of 0.001 1.
A cycle If the edge e i does not participate in any cycles in the graph, then it is called a bridge 9 .
A graph is 2-edge-connected if there exists at least two edge-disjoint paths between any two of its vertices (i.e., at least two paths such that there is no edge that appears in both paths).For a graph G(V, E) with a root vertex, let us denote the maximal 2-edge-connected subgraph that contains the root by G ′ (V ′ , E ′ ) .Note that V ′ ⊆ V , E ′ ⊆ E , the edge set E ′ does not contain any bridges, and any edge in E ′ participates in at least one cycle in G ′ .
Almost without exception, SNs based on experimental data are bipartite graphs, a result of the standard rovibronic selection rules governing transitions among the quantum states 5 .According to this, the number of edges of any cycle of the SN must be even.More explicitly, the smallest cycle in a spectroscopic network formed by dipole-allowed one-photon transitions has four edges (explaining the choice for Fig. 1).

Law of energy conservation
An important property of a directed spectroscopic network is that the sum of the w i wavenumbers along each cycle of the graph, with the weights that are travelled backwards counted as negative, is equal to zero.This property of the cycles of SNs is guaranteed by the quantum nature of the transitions, embodied in the law of energy conservation 8,11 .
Recall that in a line list we do not have the unknown w i values, only the Ŵi wavenumber intervals, and the wavenumber interval is incorrect when w i / ∈ Ŵi .Thus, the use of the law of energy conservation in this envi- ronment is as follows: one should be able to select a wavenumber from each Ŵi interval such that using these wavenumbers, the aforementioned sum along all cycles is zero.If such a wavenumber selection is not possible, then the line list contains at least one incorrect wavenumber.The reverse, however, is not true: even if such a wavenumber selection exists, there could be still incorrect wavenumbers present in the line list.
The law of energy conservation plays a pivotal role in investigations revealing incorrect wavenumber intervals in a line list.It allows to compare wavenumber intervals of multiple lines to each other, providing an excellent external information source for deciding about the correctness of a wavenumber interval.
To formalize this concept, let a wavenumber selection function, f (L) = W ′ , where L is a line list, define a set of wavenumbers W ′ = {w ′ 1 , w ′ 2 , ...} for all edges of the spectroscopic network such that for all cycles in the graph, using the w ′ i wavenumbers as edge weights, and with the weights that are travelled backwards counted as negative, is equal to zero.Additionally, for any wavenumber selection function, W ′ , it is required that w www.nature.com/scientificreports/e i and e j are parallel edges (i.e., their endpoints are the same).Note that the variable of the function f is not the set of all Ŵi wavenumber intervals in the line list but the line list itself; this definition allows the use of any kind of information contained in the line list in the selection of the w ′ i values.If there exists a wavenumber selection W ′ such that ∀w ′ i ∈ W ′ : w ′ i ∈ Ŵi , that is, if all selected wavenumbers lie in their corresponding wavenumber intervals, both the wavenumber selection and the line list are called consistent.Otherwise, if for a line list no consistent wavenumber selection exists, the line list is called inconsistent.
There is a consistent underlying line list behind Fig. 1, Table 2 proves this by showing a particular wavenumber selection.Note that if a line list is consistent, then practically there is an infinite number of consistent wavenumber selections.Therefore, additional preferences must be taken into consideration when selecting wavenumber values.Within the usual, time-proven MARVEL protocol 14,15 , for example, the w ′ i wavenumbers are determined by minimizing i | The verification labeling method introduced in this paper requires a consistent line list as its input.Therefore, although the method is demonstrated on a MARVEL-based spectroscopic data set, we disregard how the data was processed by MARVEL, and how the exact MARVEL energies were calculated.

Calculation of the energy values
Hereafter, let us consider the set {(w ′ 1 , u ′ 1 ), (w ′ 2 , u ′ 2 ), ...} .This set does not contain any parallel edges, and the u ′ i uncertainties are either the original u i uncertainties, or some of them might have been increased to make the line list consistent.
If we have a wavenumber selection W ′ , then the sum of the wavenumbers w ′ i on the edges e i of a path from the root to another vertex X, with wavenumbers of edges travelled in the reverse direction counted as negative, gives the estimation for the energy value E(X).Note that E(X) is a function of the quantum state X; not to be confused with E, that is without any variables, which denotes an edge set.The estimated energy value does not depend on the path: any path from the root to X gives the same energy value for E(X).
Similar to the uncertainties of the wavenumbers, the calculated quantum-state energies also need to be augmented with well-defined uncertainties.Let P X ⊆ E be the shortest path from the root to quantum state X using the u ′ i uncertainties as edge weights.Let us define the uncertainty of the energy value of X as Utilizing the wavenumber selection of Tables 2 and 3 shows the energy values and the corresponding uncertainties for the SN of Fig. 1, where the root quantum state is vertex A. Note that the energy value estimation depends on the w ′ i values, but the uncertainties are independent of them.

Verification of line lists Consistency does not imply correctness
If a line list is consistent, one might, albeit mistakenly, assume that it implies that all wavenumber intervals in the line list are correct.This is not true.Let us demonstrate in a simple example that consistency does not imply correctness.
The line list corresponding to Fig. 1 has already been shown to be consistent.There, the wavenumber of the A → B transition is ŵAB = 10.000 0, with an uncertainty of u AB = 0.000 1 .If we assume that they form a correct wavenumber interval, we have w AB ∈ (10 ± 0.000 1).
(1) Let us denote the line list of Fig. 1 by L orig .Let L mod be the line list we obtain from L orig after changing a wave- number value: let ŵAB = 10.01 .Observe that we have increased the original ŵAB value by a number that is much larger than the corresponding uncertainty: 0.01 > u AB .However, and this is the source of a lot of problems, the modified line list is still consistent, as proven by the wavenumber selection in Table 4.We obtain a contradiction if we assume that the wavenumber intervals of the consistent L mod line list are also correct: w AB ∈ (10 ± 0.000 1) and w AB ∈ (10.01 ± 0.000 1) cannot be true simultaneously.
Moreover, as illustrated in Table 5, this error propagates to the energy values.Most notably, observe that the uncertainty of B is U(B) = 0.000 1 , but the difference between the energy value of B in the two cases is two orders of magnitude larger.
This artificial increase of ŵAB is similar to the typical wavenumber error that originates in measurement errors or human typing mistakes.Therefore, it is necessary to develop a mathematical tool that helps to detect, assess, and handle this phenomenon in line-by-line spectroscopic datasets.

Wavenumber error detection
Can we increase or decrease wavenumber values of a consistent line list arbitrarily without losing consistency?Fortunately, for transitions that take part in at least one cycle, the law of energy conservation does provide a bound.To illustrate this, note that we cannot increase ŵAB , for example, by 100 (i.e., one hundred cm −1 ): ŵAB = 110.000would make it impossible to select wavenumber values from the four wavenumber intervals to obtain the zero sum along the cycle.
Therefore, as the law of energy conservation interconnects the transitions of the line list based on cycles, one can use this to predict the maximum artificial increase for each line that does not violate this law.Unfortunately, this cannot be applied for the bridges of the spectroscopic network: here, ŵi can be increased by any positive real number without breaking consistency.

The d i threshold of transitions
First, let us discuss wavenumber errors.For this, let us define the threshold of the ith transition e i , denoted by d i , to be the greatest number for which the line list {(w ′ 1 , u ′ 1 ), ..., (w Since consistency is based on cycles, let us restrict this definition to the non-bridge edges of the SN.
The d i values can be calculated deterministically with arbitrary accuracy, though with an enormous calcula- tion runtime, by running the wavenumber selection over and over, varying the d i candidate values.As this route is unfeasible, let us define a di upper bound for each d i as follows: for which {(0, u ′ 1 ), ..., (u ′ i + x, u ′ i ), ...(0, u ′ m )} is consistent.Observe that di ≥ d i by definition.For the the ith transition in the line list, its di value expresses that an arbitrary increase or decrease that is larger than di will be detected when checking the consistency of the line list.
Note that an arbitrary increase or decrease that is much smaller than di might also be detected when cheking consistency.Capturing these errors depends both on the line list itself and the wavenumber selection function used.Here, the mathematical statement is that errors that are larger than di will be detected at all times.
This di value can be calculated efficiently.Let us denote the shortest path between the endpoints of e i in the graph G(V , E \ {e i }) , using u ′ i edge weights, by S(e i ) .This can be done, for example, using Dijkstra's algorithm 17 (2) www.nature.com/scientificreports/(the name Dijkstra is the reason for the notation of d i ).Note that, by definition, it is an edge set: S(e i ) ⊆ E .Then, di is the length of S(e i ): For example, let us consider the estimation of d AB in Fig. 1 (note that Table 2 already proves the consistency of the underlying line list).Here, we have dAB = 0.015 1 , because S(e AB ) = {e AD , e DC , e BC }.
Note that (a) the estimation of d i depends on the line list; therefore, it is a property inherited from the underly- ing transitions of the SN, and (b) the di values can already be used as standalone pieces of information, describing the line list's own capability in detecting incorrect wavenumbers.Moreover, one can also calculate dXY before adding the very first X-Y transition to the line list, offering a priori information.Now, one can ask what happens when we calculate the uncertainty of an energy value according to the linear formula, Eq. ( 1)?We take a sum of u ′ i uncertainties, but each non-bridge edge e i has already a di value.The next subsection transfers the concept of the di values to the quantum states.

The V(X) verification of quantum states
We would like to extend the idea presented in the section "The d i threshold of transitions" from transition wavenumbers to quantum-state energies.Briefly, if the U(X) uncertainty of quantum state X is calculated by taking the sum of some u i uncertainties (see the section "Calculation of the energy values"), then let us use the corresponding di values to express the vulnerability of U(X) against incorrect wavenumbers in a new V(X) value.
We only have di values for non-bridge edges; thus, let us restrict the calculation of V(X) to the maximal 2-edge-connected subgraph of the spectroscopic network which contains the root quantum state.Recall that the vertex set of this component is denoted by V ′ .
Let us define the verification V(X) of quantum state X ∈ V ′ as follows: Briefly, V(X) is equal to U(X) plus the uncertainties along each S(e i ) path ∀i : e i ∈ P X .Observe that we count each u ′ i either zero or one time: if ∃ e i ∈ P X : e i ∈ S(e j ), i � = j , then e i appears only once in the sum that defines V(X).The rightmost column of Table 3 shows the verifications of the energy values of Fig. 1.

Small-uncertainty 4-edge-cycle density along P X
We can add a second layer when describing the vulnerability of E(X) by further inspecting its uncertaintydefining shortest path P X .Intuitively, if each e i ∈ P X participates in a large number of 4-edge-cycles (the shortest possible cycle in a bipartite SN) that all have small combined uncertainties, then X is less vulnerable to errors, than with only a few cycles, or with cycles formed by edges with large uncertainties.
To capture this phenomenon, let c(e i ) denote the number of 4-edge-cycles in which e i participates, where the sum of the other three uncertainties is smaller than 10 • di .Then, let k(X) = min e i ∈P X c(e i ).
An efficient method to find 4-edge-cycles in a large graph is shown in Ref. 18 .The k(X) value now holds information about the density of small-uncertainty 4-edge-cycles along P X ; thus, it can also be used in the labeling.

Labeling scheme
Based on the section "Verification of line lists", we can construct a labeling scheme of the quantum states, expressing the vulnerability of their energy value and its uncertainty against wavenumber errors occurring in the line list.First, considering the typical range of uncertainty values, given in cm −1 , let us introduce labels based on the V(X) verification values according to Table 6.Based on the section "Small-uncertainty 4-edge-cycle density along P X ", the second step is to pick a reasonable empirical threshold for k(X), then assign a '+' symbol after the A-F label of the quantum state X if k(X) is greater than this threshold.
The MARVEL spectroscopic information system uses Gaussian uncertainty propagation 19 when calculating energy value uncertainties.This formula for the uncertainty of X, where the subscript refers to the squaring of the uncertainties, is Table 6.Labels and the corresponding V(X) magnitudes.www.nature.com/scientificreports/with the P X path that minimizes this value.Note that this path may be different from the P X of the linear U(X) formula.However, because of the monotonicity of the square root function, it is enough to find the P X path for U 2 (X) to run Dijkstra's algorithm using not u ′ i but (u ′ i ) 2 edge weights.To make the V(X) verification comparable to these values, let us define and with the P X path that minimizes U 2 (X).

Label of X V(X) magnitude (cm
We use the same labeling when using V 2 (X) as the one that has been introduced for V(X) in Table 6.

A practical example: the W2020 database of water transitions
Hereby we discuss the verification labels corresponding to an empirical line-by-line database of water transitions, called W2020 12 , developed by two of the present co-authors.In order to get our data in line with the usual conventions of molecular spectroscopy, we opted to use the uncertainty and verification formulas that use Gaussian uncertainty propagation, see Eqs. ( 5) and (6).We decided to assign a '+' symbol after the A-F labels of quantum state X (see Table 6), if k(X) ≥ 3.
Table 7 shows the distribution of the verification labels corresponding to the H 16 2 O entries of the W2020 line list 12 .Out of the 19,282 quantum states that define the line list, 57 are not reachable from the root (they are in what we call 'floating components'), and an additional 2401 nodes can only be reached through at least one bridge.Thus, V 2 (X) and U 2 (X) values were calculated for 19, 282 − 57 − 2401 − 1 = 16, 823 quantum (the root is also omitted).
Note that the uncertainties of the 2401 quantum states which can be reached only through at least one bridge can still be calculated using transition wavenumbers of the line list, i.e., U(X) and U 2 (X) both require just the presence of one path.It is just their verification V(X) and V 2 (X) that is not defined, due to the lack of the necessary presence of at least two edge disjoint paths leading to them from the root.To calculate the energy value of the 57 quantum states in the floating components external sources are required, most notably, EH or first-principles energy-level data.
The first observation related to Table 7 is that very few energy levels have the labels A or B. This is understandable, as there are only a relatively small number of very accurate measurements, with uncertainties on the order of a few kHz, in the W2020 database for H 16 2 O and the number of energy levels which participate in a cycle of high accuracy measurements is even smaller.
Second, the situation of the quantum states with a lower-quality verification label should be addressed.It must be emphasized that an 'F'-labeled state can still have a correct wavenumber interval, it just cannot be verified more accurately using the other transitions in the line list.
The third important observation is that the most frequently occuring accuracy in the W2020 database is ∼ 10 −3 cm −1 .This is the accuracy of results obtained with the technique of Fourier-transform infrared spectroscopy, used for the largest number of transition measurements. (5) Table 7. Distribution of verification labels corresponding to the H 16 2 O entries of the W2020 line list.The 'N/A' row represents the quantum states with no label assigned: these quantum states are not in the maximal 2-edgeconnected component that contains the root; thus, they are not subject to this labeling (see the section "The V(X) verification of quantum states").Additionally, the root also received a N/A label.Note that a quantum state X received a '+' in its label if k(X) ≥ 3. The quantum states with the smallest and largest V 2 (X) values are shown in Tables 8 and 9, respectively.Note that the phenomenon that multiple U 2 and V 2 values are equal might happen quite easily; for example, in Fig. 1, quantum states B, C, and D all have the same V or V 2 value.

Label
The results in Table 8 show the effectiveness of the Spectroscopic-Network-Assisted Precision Spectroscopy (SNAPS) procedure 19,20 , used for the design of measurements yielding line center positions with just a few kHz accuracy in the near infrared region (in fact around 7000 cm −1 ).
Table 10 shows the quantum states with the smallest and largest V 2 (X)/U 2 (X) ratios.The largest ratios show that not all of the transitions measured via the SNAPS procedure are part of cycles formed by transition with high (kHz) accuracy.While an effort was made to create cycles when the measurements reported in Ref. 19 were designed, the high cost of these kHz-accuracy line center position measurements prevented to obtain an even larger number of cycles.
Finally, Fig. 2 shows the U 2 (X) and V 2 (X) values, sorted in the ascending order of the V 2 (X) values.It can be seen immediately from this figure that the V 2 (X) values are always larger than the corresponding U 2 (X) values.Most of the time the U 2 (X) value is close to the V 2 (X) value, showing that the U 2 (X) value is a good approxima- tion of the uncertainty of the empirical energy level.Nevertheless, there are several cases where U 2 (X) is much smaller than V 2 (X) .This typically occurs when a part of a path of very accurate transitions is surrounded by large uncertainty transitions in the spectroscopic network.The path itself can still produce a small U 2 (X) value, but it cannot be verified as accurately, due to the lack of transitions of similarly good uncertainty surrounding the entirety of the path.Thus, these energies might still be accurate, but even a single transition with a wavenumber error may cause a large inaccuracy for them in this line list.Table 8.Quantum states with the smallest V 2 (X) values in the W2020 12 line list of H 16 2 O. Three of the entries have the overall smallest value of 2.24 × 10 −7 , then another 9 quantum states have the second smallest V 2 (X) value, 6.30 × 10 −7 .

Quantum state
U 2 (X) V 2 (X) Table 9.Quantum states with the largest V 2 (X) values in the W2020 12 line list.A total of 12 quantum states have the largest V 2 (X) value, 0.073 511 9. Table 10.Quantum states with the largest and smallest V 2 (X)/U 2 (X) values in the W2020 12 line list.

Figure 1 .
Figure 1.Example of a small spectroscopic network (SN), with four quantum states, A, B, C, and D, and four transitions between the states.The blue (outside) numbers represent transition wavenumbers, while the red (inside) numbers represent the corresponding transition uncertainties.As usual in rovibrational spectroscopy, the unit is cm −1 .

Figure 2 .
Figure 2. U 2 (X) and V 2 (X) values in the W2020 line list, sorted in ascending order of the V 2 (X) values.

Table 1 .
Symbols and terms used in "Theoretical background" section.

Table 2 .
A wavenumber selection, proving the consistency of the underlying line list of the spectroscopic network of Fig. 1.Note that in each row u i ≥ | ŵi − w ′ i | .Similar to Fig. 1, the unit is cm −1 .

Table 4 .
Wavenumbers ŵi and uncertainties u i of the line list L mod , and a wavenumber selection (the ŵ′ i values) that proves the consistency of L mod .